IE510 Term Paper: Stochastic Gradient Descent, Weighted Sampling, and the Randomized Kaczmarz Algorithm

نویسندگان

  • Tanmay Gupta
  • Aditya Deshpande
چکیده

In this paper, we mainly study the convergence properties of stochastic gradient descent (SGD) as described in Needell et al. [2]. The function to be minimized with SGD is assumed to be strongly convex. Also, its gradients are assumed to be Lipschitz continuous. First, we discuss the superior bound on convergence (of standard SGD) obtained by Needell et al. [2] as opposed to the previous work of Bach and Moulines [1]. Then, we show that this bound can be further improved if SGD is performed with importance (weighted) sampling instead of uniform sampling. Finally, we study two applications: Logistic Regression and Kaczmarz Algorithm and demonstrate faster convergence obtained from SGD with weighted (or partially weighted) sampling. 1 Stochastic Gradient Descent In stochastic gradient descent (SGD), we minimize a function F (w) using stochastic gradients in the update rule at each step. The stochastic gradients g are such that their expectation is the gradient of F (w), E[g] = ∇F (w). Now, if F (w) = Ei∼D[fi(w)], then we have g = ∇fi(w). The update rule then is as follows, wk+1 = wk − γ∇fik(w), (1) where γ is the step size and ik is drawn i.i.d from some distribution D. In this paper, we will study the convergence of SGD on the function F , with the following assumptions 1. Each fi is continuously differentiable and the gradient of fi has a Lipschitz constant Li, i.e. ‖∇fi(w1)−∇fi(w2)‖2 ≤ Li‖w1 − w2‖2 2. F is strongly convex with parameter μ, i.e. F (w1) ≥ F (w2) +∇F (w2) (w1 − w2) + μ2 ‖w1 − w2‖ 2 2, or equivalently (w1 − w2) (∇F (w1)−∇F (w2)) ≥ μ‖w1 − w2‖2

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Stochastic Gradient Descent, Weighted Sampling, and the Randomized Kaczmarz algorithm

We obtain an improved finite-sample guarantee on the linear convergence of stochastic gradient descent for smooth and strongly convex objectives, improving from a quadratic dependence on the conditioning (L/μ) (where L is a bound on the smoothness and μ on the strong convexity) to a linear dependence on L/μ. Furthermore, we show how reweighting the sampling distribution (i.e. importance samplin...

متن کامل

Beneath the valley of the noncommutative arithmetic-geometric mean inequality: conjectures, case-studies, and consequences

Randomized algorithms that base iteration-level decisions on samples from some pool are ubiquitous in machine learning and optimization. Examples include stochastic gradient descent and randomized coordinate descent. This paper makes progress at theoretically evaluating the difference in performance between sampling withand without-replacement in such algorithms. Focusing on least means squares...

متن کامل

Toward a Noncommutative Arithmetic-geometric Mean Inequality: Conjectures, Case-studies, and Consequences

Randomized algorithms that base iteration-level decisions on samples from some pool are ubiquitous in machine learning and optimization. Examples include stochastic gradient descent and randomized coordinate descent. This paper makes progress at theoretically evaluating the difference in performance between sampling withand without-replacement in such algorithms. Focusing on least means squares...

متن کامل

Rows vs Columns for Linear Systems of Equations - Randomized Kaczmarz or Coordinate Descent?

This paper is about randomized iterative algorithms for solving a linear system of equations Xβ = y in different settings. Recent interest in the topic was reignited when Strohmer and Vershynin (2009) proved the linear convergence rate of a Randomized Kaczmarz (RK) algorithm that works on the rows of X (data points). Following that, Leventhal and Lewis (2010) proved the linear convergence of a ...

متن کامل

On the Relation Between the Randomized Extended Kaczmarz Algorithm and Coordinate Descent

In this note we compare the randomized extended Kaczmarz (EK) algorithm and randomized coordinate descent (CD) for solving the full-rank overdetermined linear least-squares problem and prove that CD needs less operations for satisfying the same residual-related termination criteria. For the general least-squares problems, we show that running first CD to compute the residual and then standard K...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016